Reuters Dataset

A Multi-class Classification Example

Introduction

The Reuters Dataset contains a set of short newswires and their topics, published by Reuters in 1986. Each topic has at least 10 examples in the training set. Our job is to classify Reuters newswires into 46 different mutually exclusive topics. There are 8982 training examples and 2246 testing examples.

Code

library(keras)

reuters = dataset_reuters(num_words = 10000) 
# 10,000 most frequently occurring words

c(c(train_data, train_labels), c(test_data,test_labels)) %<-% reuters

train_labels[[1]] 
# label associated with an example is an integer between 0 and 45 - a topic index
test_labels[[1]] 
# 1st test label is a topic index of 3 - we will predict this later

Data Preparation

Code

vectorize_sequences <- function(sequences, dimension = 10000) {
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  for (i in 1:length(sequences)){
    results[i, sequences[[i]]] <- 1 }
  results
}
x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

To vectorize labels, there are two possibilities : (1) cast the label list as an integer tensor (2) One-hot encoding.

One-hot encoding is a widely used format for categorical data, also called categorical encoding. Here, one-hot encoding of the labels consists of embedding each label as an all-zero vector with a 1 in the place of the label index.

Code

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

Network Structure

Here we are dealing with output which has 46 classes i.e. the dimensionality of output space is much larger. Earlier there were 2 classes.

In a stack of dense layers, each layer can only access information present in the output of the previous layer. If one layer drops some information relevant to the classification problem, this information can never be recovered by later layers: each layer can potentially become an information bottleneck. In the previous e.g. we used 16-dimensional intermediate layers, but a 16-dimensional space may be too limited to learn to separate 46 different classes; such small layers may act as information bottlenecks, permanently dropping relevant information.

Code

# solution -> use 64 units
model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

summary(model)

We end the network with a dense layer of size 46 because we have to classify into 46 different classes. The use of softmax activation means that the network will output a probability distribution over the 46 different output classes.

Compile

Code

model %>% compile(optimizer="rmsprop",
                  loss="categorical_crossentropy",
                  metrics=c("accuracy"))

Creating Validation

Code

val_indices = 1:1000  # 1000 samples

x_val = x_train[val_indices, ]
y_val = one_hot_train_labels[val_indices,]

partial_x_train = x_train[-val_indices,]
partial_y_train = one_hot_train_labels[-val_indices,]

Training

Code

history = model %>% fit(partial_x_train, partial_y_train, 
                        epochs=20,batch_size=512,
                        validation_data=list(x_val, y_val))

The network begins to overfit after 9 epochs !

Code

# change epochs to 9
history = model %>% fit(partial_x_train, partial_y_train, 
                        epochs=9, batch_size=512,
                        validation_data = list(x_val, y_val))
# training accuracy = 96.64% # validation accuracy = 79%
results = model %>% evaluate(x_test, one_hot_test_labels) 
# test accuracy = 78.54%

Accuracy around 78-79%. In case of a balanced binary classification problem, the accuracy reached by a purely random classifier would be 50%

Code

# Purely Random Classifier
test_labels_copy = test_labels
test_labels_copy = sample(test_labels_copy)
length(which(test_labels==test_labels_copy))/length(test_labels) # 18%

But in this case it’s closer to 18%, so the results seem pretty good, at least when compared to a random-baseline.

Predictions

Code

predictions = model %>% predict(x_test)

dim(predictions) # 2246 46
# returns a probability distribution over all 46 topics thus vector of 
# length 46 for all 2246 test examples

which.max(predictions[1,]) # 4
# class with highest probability is '4' - as we saw above, 
# the actual topic index is 3

predictions[1,4]
# topic index 4 is predicted with a probability of 98.499%

predictions[1,3]
# topic index 3 is predicted with a probability of 0.0001%

Model with Information Bottleneck

Code

model = keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", 
              input_shape = c(10000)) %>%
  layer_dense(units = 4, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

model %>% compile(optimizer="rmsprop",loss="categorical_crossentropy",
                  metrics=c("accuracy"))

model %>% fit(partial_x_train, partial_y_train, epochs=20, batch_size=128,
              validation_data = list(x_val, y_val))

The network peaks at 71% validation accuracy i.e. 8% absolute drop - mostly due to the fact that we’re trying to compress a lot of information into an intermediate space that is too low-dimensional. The network is able to cram most of the necessary information into these 8-dimensional representations, but not all of it.

Structural Experimenting

Code

model = keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", 
              input_shape = c(10000)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

model %>% compile(optimizer="rmsprop",
                  loss="categorical_crossentropy",
                  metrics=c("accuracy"))

history = model %>% fit(partial_x_train, partial_y_train, 
                        epochs=9, batch_size=512,
                        validation_data = list(x_val, y_val))

results = model %>% evaluate(x_test, one_hot_test_labels)

Number of Layers	Hidden Nodes	Accuracy
2	32, 32	76.58% (slightly lower)
2	64, 32	77.69% (close)
3	128, 128, 64	77.52% (lower)
3	64, 64, 64	77.03% (lower)